menu_book CONTEXT ENGINEERING · DEEP DIVE

Context Rot: why more context can make your model worse.

Bigger context windows feel like free memory. In practice, model accuracy decays as the input grows — even on trivial tasks, and long before the window is full. Here's the evidence, the mechanisms, and what an FDE should do about it.

schedule9 min read link8 cited sources boltPairs with Module 2

Primary source: Hong, Troynikov & Huber, “Context Rot: How Increasing Input Tokens Impacts LLM Performance,” Chroma (July 2025). Charts below are original visualizations built for this lesson, recreating the trends reported in the cited research.

Accuracy falls as input grows — for every model

Same task, only the amount of surrounding text changes. All four model families degrade, at different rates.

Claude-class Gemini-class GPT-class Qwen-class

Figure 1. Original illustration of trends reported in Chroma, Context Rot (2025). Curves are schematic, not the report's exact measurements.

The uniform-context myth

We tend to assume a model treats its 10,000th token as reliably as its 100th. Chroma's evaluation of 18 frontier models — including GPT-4.1, Claude 4, Gemini 2.5 and Qwen3 — shows that assumption is false: models do not use their context uniformly, and reliability erodes as input length grows, even when the task itself stays trivially simple.

The popular Needle-in-a-Haystack test made long context look solved — but it only measures literal keyword retrieval. Once you require semantic matching, add distractors, or scale output alongside input, performance slips in surprising, non-uniform ways.

⚡ Key takeaways

check_circleDegradation is gradual and starts long before the window is full — a 1M-token model still rots at 50K.
check_circleThe enemy is noise accumulation, not capacity. A bigger window is just more room for irrelevant tokens.
check_circleWhat you put in context, and how it's arranged, matters as much as whether the answer is present at all.

Lost in the middle

The earliest and most cited symptom: position matters. Liu et al. (Stanford, 2023) found accuracy traces a U-shape — models recall facts placed at the very start or end of the context far more reliably than facts buried in the middle, a gap that can exceed 30 percentage points.

Accuracy vs. position of the relevant fact

Figure 2. Original illustration of the “lost in the middle” effect from Liu et al. (2023).

A likely architectural cause is the long-range decay built into rotary position embeddings: distant token pairs get systematically lower attention, and softmax sharpens the bias toward the start and end of the sequence.

It isn't just position — it's similarity

Real queries rarely share exact keywords with the answer. When Chroma varied the semantic similarity between the question and the “needle,” low-similarity pairs degraded much faster as input grew — the model has to infer relevance instead of pattern-matching it. NoLiMa (Modarressi et al., 2025) reports the same: drop literal overlap and long-context scores collapse.

High vs. low question–answer similarity

High similarity Low similarity

Figure 3. Original illustration of needle–question similarity trends from Chroma (2025) and NoLiMa (2025).

Distractors compound the damage

Add text that's topically related but doesn't answer the question, and accuracy drops further. Chroma found a single distractor already hurts, and four compound the effect — amplified at longer inputs. Notably, model families differ: Claude tends to abstain under ambiguity, while GPT models more often hallucinate a confident wrong answer. This echoes Shi et al. (2023), “LLMs Can Be Easily Distracted by Irrelevant Context.”

Accuracy by number of distractors

Figure 4. Original illustration of the distractor effect reported in Chroma (2025).

Retrieval + reasoning: focused beats full

Dumping a whole chat history into the prompt forces the model to do two jobs at once — find the relevant parts and reason over them. On LongMemEval (Wu et al., 2025), Chroma compared a ~113K-token “full” prompt against a ~300-token “focused” prompt containing only the relevant turns. Every family scored far higher on the focused input.

Focused prompt vs. full prompt

Focused (~300 tokens) Full (~113K tokens)

Figure 5. Original illustration of LongMemEval focused-vs-full results from Chroma (2025); benchmark by Wu et al. (2025).

Structure, absence, and hard thresholds

Three more findings round out the picture. Counter-intuitively, Chroma found models score better on a randomly shuffled haystack than on a logically coherent one — structure changes how attention is spent. Separately, AbsenceBench (Fu et al., 2025) shows models struggle to notice what's missing from a long input. And recent threshold analysis (2026) reports some models collapse abruptly — a >40% F1 drop — once a critical fraction of the window is crossed, rather than degrading smoothly.

“Whether the answer is in the context isn't what matters most — what matters is how that information is presented.”

— Chroma, Context Rot (2025)

What this means for an FDE

The fix isn't a bigger window — it's context engineering: deliberately curating what enters the prompt.

filter_alt

Retrieve, then reason

Use RAG to pull only the relevant chunks instead of pasting everything. Focused prompts consistently win.

low_priority

Place key facts at the edges

Put the most important instructions and evidence at the start or end — not buried mid-prompt.

cleaning_services

Prune distractors & stale turns

In agents, compact or summarize old tool output. Every leftover result is a future distractor.

straighten

Measure on YOUR length

Benchmark at the input sizes you actually run. A clean NIAH score says little about your real workload.

References & sources

Hong, K., Troynikov, A., & Huber, J. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma. trychroma.com/research/context-rot
Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172
Modarressi, A., et al. (2025). NoLiMa: Long-Context Evaluation Beyond Literal Matching. arXiv:2502.05167
Wu, D., et al. (2025). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. arXiv:2410.10813
Shi, F., et al. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context. arXiv:2302.00093
Fu, H. Y., et al. (2025). AbsenceBench: Language Models Can't Tell What's Missing. arXiv:2506.11440
Hsieh, C. P., et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? arXiv:2404.06654
Intelligence Degradation in Long-Context LLMs: Critical Threshold Determination (2026). arXiv:2601.15300

Up next · Module 2

Gemini API & Context Windows

arrow_forward